Global Data Distribution Weighted Synthetic Oversampling Technique for Imbalanced Learning

نویسندگان

چکیده

Imbalanced learning is a common problem in data mining. There different distribution of samples among other classes the imbalanced datasets. It's challenge for standard algorithms designed balanced class distributions. Although there are various strategies to solve this problem, generating artificial achieve relatively universal rather than directly modifying specific classification algorithms. The oversampled can be combined with any user-specified algorithm without restrictions. In paper, we present novel oversampling method, Global Data Distribution Weighted Synthetic Oversampling Technique (GDDSYN). By applying clustering, optimizing selection criteria minority that used generate synthetic samples, avoiding more noise samples. GDDSYN assigns weights number tackle within-class imbalance and between-class simultaneously, according informative level sample sparsity cluster which belongs. use scores Silhouette Coefficient Mutual Information helps k-means set reasonable clusters majority respectively so clustering effect guaranteed. Next, by using information, samples' generation path improved avoid overlap. Additionally, has been evaluated extensively on 10 real-world sets. empirical results show our method outperforms or comparable some existing methods terms assessment metrics when generated used.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Oversampling for Imbalanced Data Classification

Data imbalance is known to significantly hinder the generalization performance of supervised learning algorithms. A common strategy to overcome this challenge is synthetic oversampling, where synthetic minority class examples are generated to balance the distribution between the examples of the majority and minority classes. We present a novel adaptive oversampling algorithm, VIRTUAL, that comb...

متن کامل

Iterative Nearest Neighborhood Oversampling in Semisupervised Learning from Imbalanced Data

Transductive graph-based semisupervised learning methods usually build an undirected graph utilizing both labeled and unlabeled samples as vertices. Those methods propagate label information of labeled samples to neighbors through their edges in order to get the predicted labels of unlabeled samples. Most popular semi-supervised learning approaches are sensitive to initial label distribution wh...

متن کامل

A Study of Synthetic Oversampling for Twitter Imbalanced Sentiment Analysis

The majority of Twitter sentiment analysis systems implicitly assume that the class distribution is balanced while in practice it is usually skewed. We argue that Twitter opinion mining using learning methods should be addressed in the framework of imbalanced learning. In this work, we present a study of synthetic oversampling techniques for tweet-polarity classification. The experiments we con...

متن کامل

A Classification Model for Imbalanced Medical Data based on PCA and Farther Distance based Synthetic Minority Oversampling Technique

Medical data are extensively used in the diagnosis of human health. So it has played a vital role for physicians as well as in medical engineering. Accordingly, many types of research are going on related to this to have a better prediction of the diseases or to improve the diagnosis quality. However, most of the researchers work on either dimensionality space or imbalanced data. Due to this, s...

متن کامل

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

Imbalanced class distribution is a challenging problem in many real-life classification problems. Existing synthetic oversampling do suffer from the curse of dimensionality because they rely heavily on Euclidean distance. This paper proposed a new method, called Minority Oversampling Technique based on Local Densities in Low-Dimensional Space (or MOT2LD in short). MOT2LD first maps each trainin...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2021

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2021.3067060